revision: use priority queue for streaming walks#2127
Open
spkrka wants to merge 3 commits into
Open
Conversation
enumerate_and_traverse_cruft_objects() initializes a rev_info on the stack but never calls release_revisions() afterwards. This is not visible on master but becomes a leak once the revision walking machinery uses dynamically allocated structures. Add the missing release_revisions() call. Signed-off-by: Kristofer Karlsson <krka@spotify.com>
get_revision_1() dispatches to different walk strategies based on a combination of rev_info flags: reflog_info, topo_walk_info, and limited. These conditions are checked in multiple places within the function -- once to select the next commit, and again to decide how to expand parents -- and the two chains must stay in sync. Extract the mode selection into a rev_walk_mode enum and a small get_walk_mode() helper, resolved once at the top of get_revision_1(). Both dispatch sites now switch on the same mode variable, making it obvious that they agree and easier to verify that all modes are handled. No functional change. Signed-off-by: Kristofer Karlsson <krka@spotify.com>
The streaming (non-limited) walk in get_revision_1() inserts newly discovered parent commits into a date-sorted queue via commit_list_insert_by_date(), which scans the linked list to find the insertion point -- O(w) per insert, where w is the width of the active walk frontier. Replace this with an O(log w) priority queue. Add a commit_queue field to rev_info alongside the existing commits linked list. The two representations are mutually exclusive: setup and external callers that need list access use the linked list, then get_revision_1() lazily drains it into the priority queue on first call. Add a REV_WALK_NO_WALK enum value to distinguish the no_walk case (which still uses the commit list) from the streaming case. The conversion function rev_info_commit_list_to_queue() is public so callers that know they will iterate can convert early. Combined with the limit_list() priority queue change already in master, this eliminates all O(w) sorted linked-list insertion from the revision walk machinery. Signed-off-by: Kristofer Karlsson <krka@spotify.com>
58004d9 to
c0e7870
Compare
Author
|
/submit |
|
Submitted as pull.2127.git.1779897003.gitgitgadget@gmail.com To fetch this version into To fetch this version to local tag |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a follow-up to kk/limit-list-optim (now in master), which
replaced the O(N) sorted linked-list insertion in limit_list() with
a priority queue. In the review thread for that patch, I
mentioned that the same approach could be applied to the streaming
(non-limited) walk in get_revision_1(). Junio suggested doing it
as a separate topic, and Peff noted he had a local branch
(jk/revs-commits-prio-queue) doing a broader conversion of
revs->commits to prio_queue entirely.
This series takes a lighter-weight approach: it keeps the linked
list for setup and external callers, and adds a separate
commit_queue field that the streaming walk drains into on first use.
This avoids touching bisect, line-log, and list-objects code, at
the cost of a "only one should be non-empty" invariant between the
two fields.
Together with the limit_list() change already in master, this
eliminates all commit_list_insert_by_date() callers from
revision.c.
Patch 1 is a small leak fix -- a missing release_revisions() call
in pack-objects that becomes visible once the commit queue uses a
dynamically allocated prio_queue array.
Patch 2 introduces a rev_walk_mode enum to replace the repeated
if/else chains in get_revision_1(). The function dispatches on
walk mode in multiple places (next commit, expand parents, flag
clearing) and these chains must stay in sync. The enum resolves
the mode once and both dispatch sites switch on the same variable.
This is a lighter alternative to the vtable-based refactoring I
mentioned before. No functional change.
Patch 3 is the actual conversion of the streaming walk to use a
priority queue.
== Why this helps ==
The streaming walk in get_revision_1() inserts newly discovered
parent commits into a date-sorted queue. On master, this uses
commit_list_insert_by_date(), which walks the linked list to find
the insertion point -- O(w) per insert, where w is the queue width
(active walk frontier).
In merge-heavy repositories, the walk frontier stays wide:
Repository Commits Peak width Avg width
monorepo (2.4M) 2,420K 2,653 1,700
linux.git 1,445K 581 235
git.git 82K 188 82
On the monorepo, each of the 2.4M commits requires scanning an
average of 1,700 list entries to find the insertion point. With
the priority queue, this drops to ~11 heap comparisons.
== Benchmarks ==
All benchmarks: best of 3 runs, same machine, commit-graph
present.
Streaming walks (affected by this series):
Regression checks -- non-merge-heavy repos (streaming path,
but frontier stays narrow so O(w) insertion was never the
bottleneck):
Regression checks -- other walk modes (not affected by this
series):
== Profile breakdown ==
perf profiling of rev-list --count HEAD on the monorepo shows
where the time goes:
master (17.94s):
commit_list_insert_by_date 79% 14.25s
fixed overhead (parse/lookup) 21% 3.69s
patched (3.38s):
heap ops (compare + sift) 16% 0.53s
fixed overhead (parse/lookup) 84% 2.85s
The queue maintenance itself sped up 27x (14.25s to 0.53s).
The overall 5.3x is lower because the fixed costs -- object
lookup (17%), commit-graph parsing (14%), memory allocation
(10%) -- are roughly constant between the two versions at ~3s.
This means the patch removes the dominant bottleneck entirely.
After the patch, the walk cost is dominated by irreducible
per-commit work (parsing and object lookup) which scales
linearly with commit count regardless of frontier width.